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Abstract 

We describe a simple neural language model that re¬ 
lies only on character-level inputs. Predictions are still 
made at the word-level. Our model employs a con¬ 
volutional neural network (CNN) and a highway net¬ 
work over characters, whose output is given to a 
long short-term memory (LSTM) recurrent neural net¬ 
work language model (RNN-LM). On the English 
Penn Treebank the model is on par with the existing 
state-of-the-art despite having 60% fewer parameters. 

On languages with rich morphology (Arabic, Czech, 
French, German, Spanish, Russian), the model out¬ 
performs word-level/morpheme-level LSTM baselines, 
again with fewer parameters. The results suggest that on 
many languages, character inputs are sufficient for lan¬ 
guage modeling. Analysis of word representations ob¬ 
tained from the character composition part of the model 
reveals that the model is able to encode, from characters 
only, both semantic and orthographic information. 

Introduction 

Language modeling is a fundamental task in artificial intel¬ 
ligence and natural language processing (NLP), with appli¬ 
cations in speech recognition, text generation, and machine 
translation. A language model is formalized as a probability 
distribution over a sequence of strings (words), and tradi¬ 
tional methods usually involve making an n-th order Markov 
assumption and estimating n-gram probabilities via count¬ 
ing and subsequent smoothing (Chen and Goodman 1998). 
The count-based models are simple to train, but probabilities 
of rare n-grams can be poorly estimated due to data sparsity 
(despite smoothing techniques). 

Neural Language Models (NLM) address the n-gram data 
sparsity issue through parameterization of words as vectors 
(word embeddings) and using them as inputs to a neural net¬ 
work (Bengio, Ducharme, and Vincent 2003; Mikolov et al. 
2010). The parameters are learned as part of the training 
process. Word embeddings obtained through NLMs exhibit 
the property whereby semantically close words are likewise 
close in the induced vector space (as is the case with non- 
neural techniques such as Latent Semantic Analysis (Deer- 
wester, Dumais, and Harshman 1990)). 
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While NLMs have been shown to outperform count-based 
n-gram language models (Mikolov et al. 2011), they are 
blind to subword information (e.g. morphemes). Lor exam¬ 
ple, they do not know, a priori, that eventful, eventfully, un¬ 
eventful, and uneventfully should have structurally related 
embeddings in the vector space. Embeddings of rare words 
can thus be poorly estimated, leading to high perplexities 
for rare words (and words surrounding them). This is espe¬ 
cially problematic in morphologically rich languages with 
long-tailed frequency distributions or domains with dynamic 
vocabularies (e.g. social media). 

In this work, we propose a language model that lever¬ 
ages subword information through a character-level con¬ 
volutional neural network (CNN), whose output is used 
as an input to a recurrent neural network language model 
(RNN-LM). Unlike previous works that utilize subword in¬ 
formation via morphemes (Botha and Blunsom 2014; Lu- 
ong, Socher, and Manning 2013), our model does not require 
morphological tagging as a pre-processing step. And, unlike 
the recent line of work which combines input word embed¬ 
dings with features from a character-level model (dos Santos 
and Zadrozny 2014; dos Santos and Guimaraes 2015), our 
model does not utilize word embeddings at all in the input 
layer. Given that most of the parameters in NLMs are from 
the word embeddings, the proposed model has significantly 
fewer parameters than previous NLMs, making it attractive 
for applications where model size may be an issue (e.g. cell 
phones). 

To summarize, our contributions are as follows: 

• on English, we achieve results on par with the existing 
state-of-the-art on the Penn Treebank (PTB), despite hav¬ 
ing approximately 60% fewer parameters, and 

• on morphologically rich languages (Arabic, Czech, 
Prench, German, Spanish, and Russian), our model 
outperforms various baselines (Kneser-Ney, word- 
level/morpheme-level LSTM), again with fewer parame¬ 
ters. 

We have released all the code for the models described in 
this paper.* 


'https://github.com/yoonkim/lstm-char-cnn 



Model 

The architecture of our model, shown in Figure 1, is straight¬ 
forward. Whereas a conventional NLM takes word embed¬ 
dings as inputs, our model instead takes the output from 
a single-layer character-level convolutional neural network 
with max-over-time pooling. 

For notation, we denote vectors with bold lower-case (e.g. 
xt, b), matrices with bold upper-case (e.g. W, U°), scalars 
with italic lower-case (e.g. x, b), and sets with cursive upper¬ 
case (e.g. V,C) letters. For notational convenience we as¬ 
sume that words and characters have already been converted 
into indices. 

Recurrent Neural Network 

A recurrent neural network (RNN) is a type of neural net¬ 
work architecture particularly suited for modeling sequen¬ 
tial phenomena. At each time step t, an RNN takes the input 
vector Xt G K” and the hidden state vector ht_i G K"* and 
produces the next hidden state hj by applying the following 
recursive operation: 

h* =/(Wxt + Uht_i + b) (1) 

Here W G G R"*^™,b G R™ are parameters 

of an affine transformation and / is an element-wise nonlin¬ 
earity. In theory the RNN can summarize all historical in¬ 
formation up to time t with the hidden state h*. In practice 
however, learning long-range dependencies with a vanilla 
RNN is difficult due to vanishing/exploding gradients (Ben- 
gio, Simard, and Frasconi 1994), which occurs as a result of 
the Jacobian’s multiplicativity with respect to time. 

Long short-term memory (LSTM) (Hochreiter and 
Schmidhuber 1997) addresses the problem of learning long 
range dependencies by augmenting the RNN with a memory 
cell vector Cj G R" at each time step. Concretely, one step 
of an LSTM takes as input Xj, ht_i, Ct-i and produces ht, 
Ct via the following intermediate calculations: 

p = a(W*xt-fU*ht_i+b*) 

+ U^ht_i + b^) 
Ot=(T(W°Xt-fU°ht_l-fb°) 
gt = tanh(W®Xt + U®ht_i + b®) 

Ct = ft © ct_i -f it © gt 

ht = Ot © tanh(ct) 

Here cr(-) and tanh(-) are the element-wise sigmoid and hy¬ 
perbolic tangent functions, © is the element-wise multipli¬ 
cation operator, and it, ft, Ot are referred to as input, for¬ 
get, and output gates. At f = 1, hp and Cq are initialized to 
zero vectors. Parameters of the LSTM are W-^ , U-^ , V for 
j e {i,f,o,g}. 

Memory cells in the LSTM are additive with respect to 
time, alleviating the gradient vanishing problem. Gradient 
exploding is still an issue, though in practice simple opti¬ 
mization strategies (such as gradient clipping) work well. 
LSTMs have been shown to outperform vanilla RNNs on 
many tasks, including on language modeling (Sundermeyer, 
Schluter, and Ney 2012). It is easy to extend the RNN/LSTM 
to two (or more) layers by having another network whose 
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Figure 1: Architecture of our language model applied to an exam¬ 
ple sentence. Best viewed in color. Here the model takes absurdity 
as the current input and combines it with the history (as represented 
by the hidden state) to predict the next word, is. First layer performs 
a lookup of character embeddings (of dimension four) and stacks 
them to form the matrix C*’. Then convolution operations are ap¬ 
plied between C* and multiple filter matrices. Note that in the 
above example we have twelve filters—three filters of width two 
(blue), four filters of width three (yellow), and five filters of width 
four (red). A max-over-time pooling operation is applied to obtain 
a fixed-dimensional representation of the word, which is given to 
the highway network. The highway network’s output is used as the 
input to a multi-layer LSTM. Finally, an affine transformation fol¬ 
lowed by a softmax is applied over the hidden representation of 
the LSTM to obtain the distribution over the next word. Cross en¬ 
tropy loss between the (predicted) distribution over next word and 
the actual next word is minimized. Element-wise addition, multi¬ 
plication, and sigmoid operators are depicted in circles, and affine 
transformations (plus nonlinearities where appropriate) are repre¬ 
sented by solid arrows. 


input at t is hj (from the first network). Indeed, having mul¬ 
tiple layers is often crucial for obtaining competitive perfor¬ 
mance on various tasks (Pascanu et al. 2013). 

Recurrent Neural Network Language Model 

Let V be the fixed size vocabulary of words. A language 
model specifies a distribution over Wt+i (whose support is 
V) given the historical sequence wi.,t = [rui, •.., w*]. A re¬ 
current neural network language model (RNN-LM) does this 


















































































by applying an affine transformation to the hidden layer fol¬ 
lowed by a softmax: 


Pr(wt +1 =j\wi:t) 


exp(ht ■ pJ -f gJ) 
Ej'ev exp(ht • pJ' +q^') 


(3) 


where is the j-th column of P € 1^1 (also referred to 

as the output embedding),^ and is a bias term. Similarly, 
for a conventional RNN-LM which usually takes words as 
inputs, if wt = k, then the input to the RNN-LM at t is 
the input embedding x^, the fc-th column of the embedding 
matrix X G Our model simply replaces the input 

embeddings X with the output from a character-level con¬ 
volutional neural network, to be described below. 

If we denote wi-t = [tui, • • • , wt] to be the sequence of 
words in the training corpus, training involves minimizing 
the negative log-likelihood (NLL) of the sequence 


where 1] is the i-to-(t-|-w —l)-th column of 

and (A, B) = Tr(AB^) is the Frobenius inner product. 
Finally, we take the max-over-time 

/=maxf'=[i] (6) 

i 

as the feature corresponding to the filter H (when applied to 
word k). The idea is to capture the most important feature— 
the one with the highest value—for a given filter. A filter is 
essentially picking out a character n-gram, where the size of 
the n-gram corresponds to the filter width. 

We have described the process by which one feature is 
obtained from one filter matrix. Our CharCNN uses multiple 
filters of varying widths to obtain the feature vector for k. 
So if we have a total of h filters Hi,..., H^, then = 
[t/i,..., y^] is the input representation of k. For many NLP 
applications h is typically chosen to be in [100,1000]. 


T 

ALL =-^logPr(wt|'u;i,t_i) (4) 

t=i 

which is typically done by truncated backpropagation 
through time (Werbos 1990; Graves 2013). 

Character-level Convolutional Neural Network 

In our model, the input at time t is an output from a 
character-level convolutional neural network (CharCNN), 
which we describe in this section. CNNs (LeCun et al. 
1989) have achieved state-of-the-art results on computer vi¬ 
sion (Krizhevsky, Sutskever, and Hinton 2012) and have also 
been shown to be effective for various NLP tasks (Collobert 
et al. 2011). Architectures employed for NLP applications 
differ in that they typically involve temporal rather than spa¬ 
tial convolutions. 

Let C be the vocabulary of characters, d be the dimen¬ 
sionality of character embeddings,^ and Q G be the 

matrix character embeddings. Suppose that word fc G 12 is 
made up of a sequence of characters [ci,..., c;], where I is 
the length of word k. Then the character-level representation 
of k is given by the matrix G where the j-th col¬ 

umn corresponds to the character embedding for Cj (i.e. the 
Cj-th column of Q)."^ 

We apply a narrow convolution between and filter 
(or kernel) H G of width w, after which we add a 

bias and apply a nonlinearity to obtain s. feature map f* G 
Specifically, the i-th element of f^ is given by: 

f'=[i] = tanh{{C’^[*,i : i + w - + b) (5) 

^In our work, predictions are at the word-level, and hence we 
still utilize word embeddings in the output layer. 

^Given that |C| is usually small, some authors work with one- 
hot representations of characters. However we found that using 
lower dimensional representations of characters (i.e. d < |C|) per¬ 
formed slightly better. 

"'Two technical details warrant mention here: (1) we append 
start-of-word and end-of-word characters to each word to better 
represent prefixes and suffixes and hence C*" actually has I -|- 2 
columns; (2) for batch processing, we zero-pad C*" so that the num¬ 
ber of columns is constant (equal to the max word length) for all 
words in V. 


Highway Network 

We could simply replace x^ (the word embedding) with 
at each t in the RNN-LM, and as we show later, this simple 
model performs well on its own (Table 7). One could also 
have a multilayer perceptron (MLP) over y^ to model in¬ 
teractions between the character n-grams picked up by the 
filters, but we found that this resulted in worse performance. 

Instead we obtained improvements by running y^ through 
a highway network, recently proposed by Srivastava et al. 
(2015). Whereas one layer of an MLP applies an affine trans¬ 
formation followed by a nonlinearity to obtain a new set of 
features, 

z = g{Wy + b) (7) 

one layer of a highway network does the following: 

z = t © 5 f(W//y-f b//)-b (1 - t) © y (8) 

where p is a nonlinearity, t = cr(WTy -b by) is called the 
transform gate, and (1—t) is called the carry gate. Similar to 
the memory cells in LSTM networks, highway layers allow 
for training of deep networks by adaptively carrying some 
dimensions of the input directly to the output.^ By construc¬ 
tion the dimensions of y and z have to match, and hence 
Wt and Wjj are square matrices. 

Experimental Setup 

As is standard in language modeling, we use perplexity 
(PPL) to evaluate the performance of our models. Perplex¬ 
ity of a model over a sequence [wi,..., wt] is given by 



where NLL is calculated over the test set. We test the model 
on corpora of varying languages and sizes (statistics avail¬ 
able in Table 1). 

We conduct hyperparameter search, model introspection, 
and ablation studies on the English Penn Treebank (PTB) 
(Marcus, Santorini, and Marcinkiewicz 1993), utilizing the 

^Srivastava et al. (2015) recommend initializing by to a neg¬ 
ative value, in order to militate the initial behavior towards carry. 
We initialized hr to a small interval around —2. 





|V| 

Data-s 

|C| 

T 

Data-l 

|V| |C| 

T 

English (En) 

10 k 

51 

1 m 

60 k 

197 

20 m 

Czech (Cs) 

46 k 

101 

Im 

206 k 

195 

17 m 

German (De) 

37 k 

74 

Im 

339 k 

260 

51 m 

Spanish (Es) 

27 k 

72 

Im 

152 k 

222 

56 m 

French (Fr) 

25 k 

76 

Im 

137 k 

225 

57 m 

Russian (Ru) 

62 k 

62 

Im 

497 k 

111 

25 m 

Arabic (Ar) 

86 k 

132 

4 m 

- 

- 

- 


Table 1: Corpus statistics. | V| = word vocabulary size; \C\ = char¬ 
acter vocabulary size; T = number of tokens in training set. The 
small English data is from the Penn Treebank and the Arabic data 
is from the News-Commentary corpus. The rest are from the 2013 
ACL Workshop on Machine Translation. \C\ is large because of 
(rarely occurring) special characters. 




Small 

Large 


d 

15 

15 

CNN 

w 

11,2,3,4,5,6] 

[1,2,3,4,5,6,71 

h 

[25 • w] 

[min{200, 50 • w}] 


f 

tanh 

tanh 

Highway 

1 

9 

1 

ReLU 

2 

ReLU 

LSTM 

1 

2 

2 

m 

300 

650 


Table 2: Architecture of the small and large models, d = 
dimensionality of character embeddings; w — filter widths; 
h = number of filter matrices, as a function of filter width 
(so the large model has filters of width [1, 2, 3, 4, 5,6, 7] of 
size [50,100,150, 200, 200, 200, 200] for a total of 1100 filters); 
f,g~ nonlinearity functions; I = number of layers; m = number 
of hidden units. 


standard training (0-20), validation (21-22), and test (23-24) 
splits along with pre-processing by Mikolov et al. (2010). 
With approximately Im tokens and |)2| = 10k, this version 
has been extensively used by the language modeling com¬ 
munity and is publicly available.® 

With the optimal hyperparameters tuned on PTB, we ap¬ 
ply the model to various morphologically rich languages: 
Czech, German, French, Spanish, Russian, and Arabic. Non- 
Arabic data comes from the 2013 ACL Workshop on Ma¬ 
chine Translation,^ and we use the same train/validation/test 
splits as in Botha and Blunsom (2014). While the raw data 
are publicly available, we obtained the preprocessed ver¬ 
sions from the authors,* whose morphological NLM serves 
as a baseline for our work. We train on both the small 
datasets (Data-S) with Im tokens per language, and the 
large datasets (Data-l) including the large English data 
which has a much bigger I'Ll than the PTB. Arabic data 
comes from the News-Commentary corpus,® and we per¬ 
form our own preprocessing and train/validation/test splits. 

In these datasets only singleton words were replaced with 
<unk> and hence we effectively use the full vocabulary. It 
is worth noting that the character model can utilize surface 
forms of OOV tokens (which were replaced with <unk>), but 
we do not do this and stick to the preprocessed versions (de¬ 
spite disadvantaging the character models) for exact com¬ 
parison against prior work. 

Optimization 

The models are trained by truncated backpropagation 
through time (Werbos 1990; Graves 2013). We backprop- 
agate for 35 time steps using stochastic gradient descent 
where the learning rate is initially set to 1.0 and halved if 
the perplexity does not decrease by more than 1.0 on the 
validation set after an epoch. On Data-S we use a batch 
size of 20 and on Data-L we use a batch size of 100 (for 

®http://www.fit. vutbr.cz/~imikolov/rnnlm/ 
’http://www.statmt.org/wmtl3/translation-task.html 
*http://bothameister.github.io/ 
®http://opus.lingfil.uu.se/News-Commentary.php 


greater efficiency). Gradients are averaged over each batch. 
We train for 25 epochs on non-Arabic and 30 epochs on Ara¬ 
bic data (which was sufficient for convergence), picking the 
best performing model on the validation set. Parameters of 
the model are randomly initialized over a uniform distribu¬ 
tion with support [—0.05, 0.05]. 

For regularization we use dropout (Hinton et al. 2012) 
with probability 0.5 on the LSTM input-to-hidden layers 
(except on the initial Highway to LSTM layer) and the 
hidden-to-output softmax layer. We further constrain the 
norm of the gradients to be below 5, so that if the L 2 norm 
of the gradient exceeds 5 then we renormalize it to have 
II • II = 5 before updating. The gradient norm constraint 
was crucial in training the model. These choices were largely 
guided by previous work of Zaremba et al. (2014) on word- 
level language modeling with LSTMs. 

Finally, in order to speed up training on Data-L we em¬ 
ploy a hierarchical softmax (Morin and Bengio 2005)—a 
common strategy for training language models with very 
large I'Ll—instead of the usual softmax. We pick the number 
of clusters c = [ \/^[)^] and randomly split V into mutually 
exclusive and collectively exhaustive subsets 'Ll,..., 'Fj, of 
(approximately) equal size.^® Then Pr(rt;t+i = j\wi-,t) be¬ 
comes. 


Pr(wt +1 =j\wi:t) = 


exp(hi • s’" + f'’) 
Er'=iexp(ht • s'-' +r') 
exp(ht ■ pI + 
E7'ev,,exp(ht • pi' +q’r' 


where r is the cluster index such that j € Vr- The first term 
is simply the probability of picking cluster r, and the second 


*®While Brown clustering/frequency-based clustering is com¬ 
monly used in the literature (e.g. Botha and Blunsom (2014) use 
Brown clusering), we used random clusters as our implementation 
enjoys the best speed-up when the number of words in each clus¬ 
ter is approximately equal. We found random clustering to work 
surprisingly well. 




PPL 

Size 

LSTM-Word-Small 

97.6 

5 m 

LSTM-Char-Small 

92.3 

5 m 

LSTM-Word-Large 

85.4 

20 m 

LSTM-Char-Large 

78.9 

19 m 

KN-5 (Mikolov et al. 2012) 

141.2 

2 m 

RNNt (Mikolov et al. 2012) 

124.7 

6 m 

RNN-LDAt (Mikolov et al. 2012) 

113.7 

7m 

genCNNt (Wang et al. 2015) 

116.4 

8m 

EOEE-ENNLMt (Zhang et al. 2015) 

108.0 

6 m 

Deep RNN (Pascanu et al. 2013) 

107.5 

6 m 

Sum-Prod Net^ (Cheng et al. 2014) 

100.0 

5 m 

LSTM-lt (Zaremba et al. 2014) 

82.7 

20 m 

LSTM-2t (Zaremba et al. 2014) 

78.4 

52 m 


Table 3: Performance of our model versus other neural language 
models on the English Penn Treebank test set. PPL refers to per¬ 
plexity (lower is better) and size refers to the approximate number 
of parameters in the model. KN-5 is a Kneser-Ney 5-gram language 
model which serves as a non-neural baseline. ^For these models the 
authors did not explicitly state the number of parameters, and hence 
sizes shown here are estimates based on our understanding of their 
papers or private correspondence with the respective authors. 


term is the probability of picking word j given that cluster r 
is picked. We found that hierarchical softmax was not nec¬ 
essary for models trained on Data-S. 

Results 

English Penn Treebank 

We train two versions of our model to assess the trade-off 
between performance and size. Architecture of the small 
(LSTM-Char-Small) and large (LSTM-Char-Large) models 
is summarized in Table 2. As another baseline, we also 
train two comparable LSTM models that use word em¬ 
beddings only (LSTM-Word-Small, LSTM-Word-Large). 
LSTM-Word-Small uses 200 hidden units and LSTM-Word- 
Large uses 650 hidden units. Word embedding sizes are 
also 200 and 650 respectively. These were chosen to keep 
the number of parameters similar to the corresponding 
character-level model. 

As can be seen from Table 3, our large model is on 
par with the existing state-of-the-art (Zaremba et al. 2014), 
despite having approximately 60% fewer parameters. Our 
small model significantly outperforms other NLMs of sim¬ 
ilar size, even though it is penalized by the fact that the 
dataset already has OOV words replaced with <unk> (other 
models are purely word-level models). While lower perplex¬ 
ities have been reported with model ensembles (Mikolov and 
Zweig 2012), we do not include them here as they are not 
comparable to the current work. 

Other Languages 

The model’s performance on the English PTB is informative 
to the extent that it facilitates comparison against the large 
body of existing work. However, English is relatively simple 


Data-s 




CS 

De 

Es 

Er 

Ru 

Ar 

Botha 

KN-4 

545 

366 

241 

274 

396 

323 

MLBL 

465 

296 

200 

225 

304 

- 


Word 

503 

305 

212 

229 

352 

216 

Small 

Morph 

414 

278 

197 

216 

290 

230 


Char 

401 

260 

182 

189 

278 

196 


Word 

493 

286 

200 

222 

357 

172 

Large 

Morph 

398 

263 

177 

196 

271 

148 


Char 

371 

239 

165 

184 

261 

148 


Table 4: Test set perplexities for Data-S. First two rows are from 
Botha (2014) (except on Arabic where we trained our own KN-4 
model) while the last six are from this paper. KN-4 is a Kneser- 
Ney 4-gram language model, and MLBL is the best performing 
morphological logbilinear model from Botha (2014). Small/Large 
refer to model size (see Table 2), and Word/Morph/Char are models 
with words/morphemes/characters as inputs respectively. 


from a morphological standpoint, and thus our next set of 
results (and arguably the main contribution of this paper) 
is focused on languages with richer morphology (Table 4, 
Table 5). 

We compare our results against the morphological log- 
bilinear (MLBL) model from Botha and Blunsom (2014), 
whose model also takes into account subword information 
through morpheme embeddings that are summed at the input 
and output layers. As comparison against the MLBL mod¬ 
els is confounded by our use of LSTMs—widely known 
to outperform their feed-forward/log-bilinear cousins—we 
also train an LSTM version of the morphological NLM, 
where the input representation of a word given to the LSTM 
is a summation of the word’s morpheme embeddings. Con¬ 
cretely, suppose that Ai is the set of morphemes in a lan¬ 
guage, M e is the matrix of morpheme embed¬ 

dings, and is the j-th column of M (i.e. a morpheme 
embedding). Given the input word k, we feed the following 
representation to the LSTM; 

^ ( 11 ) 

j&Mk 

where x* is the word embedding (as in a word-level model) 
and Affc C A4 is the set of morphemes for word k. The 
morphemes are obtained by running an unsupervised mor¬ 
phological tagger as a preprocessing step.^' We emphasize 
that the word embedding itself (i.e. x^) is added on top of the 
morpheme embeddings, as was done in Botha and Blunsom 
(2014). The morpheme embeddings are of size 200/650 for 
the smalElarge models respectively. We further train word- 
level LSTM models as another baseline. 

On Data-S it is clear from Table 4 that the character- 
level models outperform their word-level counterparts de- 


**We use Morfessor Cat-MAP (Creutz and Lagus 2007), as in 
Botha and Blunsom (2014). 



Data-l 




Cs 

De 

ES 

Fr 

Ru 

En 

Botha 

KN-4 

862 

463 

219 

243 

390 

291 

MLBL 

643 

404 

203 

227 

300 

273 


Word 

701 

347 

186 

202 

353 

236 

Small 

Morph 

615 

331 

189 

209 

331 

233 


Char 

578 

305 

169 

190 

313 

216 


Table 5: Test set perplexities on Data-L. First two rows are from 
Botha (2014), while the last three rows are from the small LSTM 
models described in the paper. KN-4 is a Kneser-Ney 4-gram lan¬ 
guage model, and MLBL is the best performing morphological log- 
bilinear model from Botha (2014). Word/Morph/Char are models 
with words/morphemes/characters as inputs respectively. 


spite, again, being smaller.*^ The character models also out¬ 
perform their morphological counterparts (both MLBL and 
LSTM architectures), although improvements over the mor¬ 
phological LSTMs are more measured. Note that the mor¬ 
pheme models have strictly more parameters than the word 
models because word embeddings are used as part of the in¬ 
put. 

Due to memory constraints'^ we only train the small 
models on Data-L (Table 5). Interestingly we do not ob¬ 
serve significant differences going from word to morpheme 
LSTMs on Spanish, French, and English. The character 
models again outperform the word/morpheme models. We 
also observe signihcant perplexity reductions even on En¬ 
glish when V is large. We conclude this section by noting 
that we used the same architecture for all languages and did 
not perform any language-specihc tuning of hyperparame¬ 
ters. 


Discussion 
Learned Word Representations 

We explore the word representations learned by the models 
on the PTB. Table 6 has the nearest neighbors of word rep¬ 
resentations learned from both the word-level and character- 
level models. Eor the character models we compare the rep¬ 
resentations obtained before and after highway layers. 

Before the highway layers the representations seem to 
solely rely on surface forms—for example the nearest neigh¬ 
bors of you are your, young, four, youth, which are close to 
you in terms of edit distance. The highway layers however, 
seem to enable encoding of semantic features that are not 
discernable from orthography alone. After highway layers 
the nearest neighbor of you is we, which is orthographically 
distinct from you. Another example is while and though — 
these words are far apart edit distance-wise yet the composi¬ 
tion model is able to place them near each other. The model 

'^The difference in parameters is greater for non-PTB corpora 
as the size of the word model scales faster with ll^l. For example, 
on Arabic the small/large word models have 35m/121m parameters 
while the corresponding character models have 29m/69m parame¬ 
ters respectively. 

All models were trained on GPUs with 2GB memory. 



Figure 2: Plot of character n-gram representations via PCA for 
English. Colors correspond to: prefixes (red), suffixes (blue), hy¬ 
phenated (orange), and all others (grey). Prefixes refer to character 
n-grams which start with the start-of-word character. Suffixes like¬ 
wise refer to character n-grams which end with the end-of-word 
character. 


also makes some clear mistakes (e.g. his and hhs), highlight¬ 
ing the limits of our approach, although this could be due to 
the small dataset. 

The learned representations of OOV words (computer- 
aided, misinformed) are positioned near words with the 
same part-of-speech. The model is also able to correct for 
incorrect/non-standard spelling (looooook), indicating po¬ 
tential applications for text normalization in noisy domains. 

Learned Character A^-gram Representations 

As discussed previously, each filter of the CharCNN is es¬ 
sentially learning to detect particular character n-grams. Our 
initial expectation was that each filter would learn to activate 
on different morphemes and then build up semantic repre¬ 
sentations of words from the identified morphemes. How¬ 
ever, upon reviewing the character n-grams picked up by 
the filters (i.e. those that maximized the value of the filter), 
we found that they did not (in general) correspond to valid 
morphemes. 

To get a better intuition for what the character composi¬ 
tion model is learning, we plot the learned representations 
of all character n-grams (that occurred as part of at least two 
words in V) via principal components analysis (Figure 2). 
We feed each character n-gram into the CharCNN and use 
the CharCNN’s output as the fixed dimensional representa¬ 
tion for the corresponding character n-gram. As is appar¬ 
ent from Figure 2, the model learns to differentiate between 
prefixes (red), suffixes (blue), and others (grey). We also find 
that the representations are particularly sensitive to character 
n-grams containing hyphens (orange), presumably because 
this is a strong signal of a word’s part-of-speech. 

Highway Layers 

We quantitatively investigate the effect of highway network 
layers via ablation studies (Table 7). We train a model with¬ 
out any highway layers, and find that performance decreases 
significantly. As the difference in performance could be 
due to the decrease in model size, we also train a model 
that feeds (i.e. word representation from the CharCNN) 








while 

his 

In Vocabulary 

you 

richard 

trading 

Out-of-Vocabulary 

computer-aided misinformed 

looooook 


although 

your 

conservatives 

Jonathan 

advertised 

- 

- 

- 

LSTM-Word 

letting 

her 

we 

robert 

advertising 

- 

- 

- 


though 

my 

guys 

neil 

turnover 

- 

- 

- 


minute 

their 

i 

nancy 

turnover 

- 

- 

- 


chile 

this 

your 

hard 

heading 

computer-guided 

informed 

look 

LSTM-Char 

whole 

hhs 

young 

rich 

training 

computerized 

performed 

cook 

(before highway) 

meanwhile 

is 

four 

richer 

reading 

disk-drive 

transformed 

looks 


white 

has 

youth 

richter 

leading 

computer 

inform 

shook 


meanwhile 

hhs 

we 

eduard 

trade 

computer-guided 

informed 

look 

LSTM-Char 

whole 

this 

your 

gerard 

training 

computer-driven 

performed 

looks 

(after highway) 

though 

their 

doug 

edward 

traded 

computerized 

outperformed 

looked 


nevertheless 

your 

i 

carl 

trader 

computer 

transformed 

looking 


Table 6: Nearest neighbor words (based on cosine similarity) of word representations from the large word-level and character-level (before 
and after highway layers) models trained on the PTB. Last three words are OOV words, and therefore they do not have representations in the 
word-level model. 


LSTM-Char 



Small 

Large 

No Highway Layers 

100.3 

84.6 

One Highway Layer 

92.3 

79.7 

Two Highway Layers 

90.1 

78.9 

One MLP Layer 

111.2 

92.6 
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8% 
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9% 

9% 
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9% 

8% 

9% 
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Table 7: Perplexity on the Penn Treebank for small/large models 
trained with/without highway layers. 


through a one-layer multilayer perceptron (MLP) to use as 
input into the LSTM. We find that the MLP does poorly, al¬ 
though this could be due to optimization issues. 

We hypothesize that highway networks are especially 
well-suited to work with CNNs, adaptively combining lo¬ 
cal features detected by the individual filters. CNNs have 
already proven to be been successful for many NLP tasks 
(Collobert et al. 2011; Shen et al. 2014; Kalchbrenner, 
Grefenstette, and Blunsom 2014; Kim 2014; Zhang, Zhao, 
and LeCun 2015; Lei, Barzilay, and Jaakola 2015), and we 
posit that further gains could be achieved by employing 
highway layers on top of existing CNN architectures. 

We also anecdotally note that (1) having one to two high¬ 
way layers was important, but more highway layers gener¬ 
ally resulted in similar performance (though this may de¬ 
pend on the size of the datasets), (2) having more convolu¬ 
tional layers before max-pooling did not help, and (3) high¬ 
way layers did not improve models that only used word em¬ 
beddings as inputs. 

Effect of CorpusA^ocab Sizes 

We next study the effect of training corpus/vocabulary sizes 
on the relative performance between the different models. 
We take the German (De) dataset from Data-L and vary the 
training corpus/vocabulary sizes, calculating the perplex¬ 


Table 8: Perplexity reductions by going from small word-level to 
character-level models based on different corpus/vocabulary sizes 
on German (De). |V| is the vocabulary size and T is the number 
of tokens in the training set. The full vocabulary of the Im dataset 
was less than 100k and hence that scenario is unavailable. 


ity reductions as a result of going from a small word-level 
model to a small character-level model. To vary the vocabu¬ 
lary size we take the most frequent k words and replace the 
rest with <unk>. As with previous experiments the character 
model does not utilize surface forms of <unk> and simply 
treats it as another token. Although Table 8 suggests that the 
perplexity reductions become less pronounced as the corpus 
size increases, we nonetheless find that the character-level 
model outperforms the word-level model in all scenarios. 

Further Observations 

We report on some further experiments and observations: 

• Combining word embeddings with the CharCNN’s out¬ 
put to form a combined representation of a word (to be 
used as input to the LSTM) resulted in slightly worse 
performance (81 on PTB with a large model). This was 
surprising, as improvements have been reported on part- 
of-speech tagging (dos Santos and Zadrozny 2014) and 
named entity recognition (dos Santos and Guimaraes 
2015) by concatenating word embeddings with the out¬ 
put from a character-level CNN. While this could be due 




to insufficient experimentation on our part,'^ it suggests 
that for some tasks, word embeddings are superfluous— 
character inputs are good enough. 

• While our model requires additional convolution opera¬ 
tions over characters and is thus slower than a comparable 
word-level model which can perform a simple lookup at 
the input layer, we found that the difference was manage¬ 
able with optimized GPU implementations—for example 
on PTB the large character-level model trained at 1500 to¬ 
kens/sec compared to the word-level model which trained 
at 3000 tokens/sec. For scoring, our model can have the 
same running time as a pure word-level model, as the 
CharCNN’s outputs can be pre-computed for all words in 
V. This would, however, be at the expense of increased 
model size, and thus a trade-off can be made between 
run-time speed and memory (e.g. one could restrict the 
pre-computation to the most frequent words). 

Related Work 

Neural Language Models (NLM) encompass a rich fam¬ 
ily of neural network architectures for language modeling. 
Some example architectures include feed-forward (Bengio, 
Ducharme, and Vincent 2003), recurrent (Mikolov et al. 
2010), sum-product (Cheng et al. 2014), log-bilinear (Mnih 
and Hinton 2007), and convolutional (Wang et al. 2015) net¬ 
works. 

In order to address the rare word problem, Alexandrescu 
and Kirchhoff (2006)—building on analogous work on 
count-based n-gram language models by Bilmes and Kirch¬ 
hoff (2003)—^represent a word as a set of shared factor em¬ 
beddings. Their Factored Neural Language Model (FNLM) 
can incorporate morphemes, word shape information (e.g. 
capitalization) or any other annotation (e.g. part-of-speech 
tags) to represent words. 

A specific class of FNLMs leverages morphemic infor¬ 
mation by viewing a word as a function of its (learned) 
morpheme embeddings (Luong, Socher, and Manning 2013; 
Botha and Blunsom 2014; Qui et al. 2014). For example Lu¬ 
ong, Socher, and Manning (2013) apply a recursive neural 
network over morpheme embeddings to obtain the embed¬ 
ding for a single word. While such models have proved use¬ 
ful, they require morphological tagging as a preprocessing 
step. 

Another direction of work has involved purely character- 
level NLMs, wherein both input and output are charac¬ 
ters (Sutskever, Martens, and Hinton 2011; Graves 2013). 
Character-level models obviate the need for morphological 
tagging or manual feature engineering, and have the attrac¬ 
tive property of being able to generate novel words. How¬ 
ever they are generally outperformed by word-level models 
(Mikolov et al. 2012). 

Outside of language modeling, improvements have 
been reported on part-of-speech tagging (dos Santos and 
Zadrozny 2014) and named entity recognition (dos Santos 

'"'We experimented with (1) concatenation, (2) tensor products, 
(3) averaging, and (4) adaptive weighting schemes whereby the 
model learns a convex combination of word embeddings and the 
CharCNN outputs. 


and Guimaraes 2015) by representing a word as a concatena¬ 
tion of its word embedding and an output from a character- 
level CNN, and using the combined representation as fea¬ 
tures in a Conditional Random Field (CRF). Zhang, Zhao, 
and LeCun (2015) do away with word embeddings com¬ 
pletely and show that for text classification, a deep CNN 
over characters performs well. Ballesteros, Dyer, and Smith 
(2015) use an RNN over characters only to train a transition- 
based parser, obtaining improvements on many morpholog¬ 
ically rich languages. 

Finally, Ling et al. (2015) apply a bi-directional LSTM 
over characters to use as inputs for language modeling and 
part-of-speech tagging. They show improvements on various 
languages (English, Portuguese, Catalan, German, Turkish). 
It remains open as to which character composition model 
(i.e. CNN or LSTM) performs better. 

Conclusion 

We have introduced a neural language model that utilizes 
only character-level inputs. Predictions are still made at the 
word-level. Despite having fewer parameters, our model 
outperforms baseline models that utilize word/morpheme 
embeddings in the input layer. Our work questions the ne¬ 
cessity of word embeddings (as inputs) for neural language 
modeling. 

Analysis of word representations obtained from the char¬ 
acter composition part of the model further indicates that 
the model is able to encode, from characters only, rich se¬ 
mantic and orthographic features. Using the CharCNN and 
highway layers for representation learning (e.g. as input into 
word2vec (Mikolov et al. 2013)) remains an avenue for fu¬ 
ture work. 

Insofar as sequential processing of words as inputs is 
ubiquitous in natural language processing, it would be in¬ 
teresting to see if the architecture introduced in this paper is 
viable for other tasks—for example, as an encoder/decoder 
in neural machine translation (Cho et al. 2014; Sutskever, 
Vinyals, and Le 2014). 
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